Method for clustering sequences in groups

ABSTRACT

In order to cluster sequences to biological groups, the conventional databank search programs are iteratively called in with a view to clustering various related sequences to one determined protein sequence. The inventive method enables full automatic distribution of a high number of protein sequences in groups. The major part of such groups are segregated, so that they represent a meaningful and valid grouping of data.

This invention concerns a method of grouping sequences in families.

Large quantities of protein sequence data are generated today inmolecular biology. A major problem here is how to group such proteinsequence data logically in biological families. Since families are notdefined exactly, but instead the diversity of different gene familiesvaries, this involves a problem in data grouping which is not at alltrivial.

In the past, biological information could only be of assistance forhuman experts who would thoroughly research the output of databasesearching programs and would create a grouping according to families.This method is time-consuming, labor-intensive and not veryreproducible.

Therefore, the object of this invention is to find a method with which alarge set of protein sequences can be divided into groups fullyautomatically.

This object is achieved with the features of Patent Claim 1.

The method described here is based on the finding that rapid groupingcan be achieved when traditional database searching programs are runiteratively to find a quantity of sequences related to a given proteinsequence.

It is advantageous if the method described here is carried out for eachsequence in the database, removing clusters that occur repeatedly exceptfor one cluster each, removing clusters that are contained in otherclusters; of the remaining quantity of clusters, outputting the clustersthat do not overlap with other clusters as partitioning of the databaseand outputting the remaining portion of the overlapping clusters asgroups whose clusters are linked together by overlapping.

In this way, database clustering is achieved, resulting in the fact thatmost of the clusters are disjunctive to one another, and therefore avalid and reasonable clustering of data is achieved.

This method permits a much faster and more objective analysis of newsequence data than has been possible in the past. The paired disjunctivepart of the clustering no longer requires any checking from a practicalstandpoint, and it forms the ideal basis for automatic annotations andmore extensive analyses. The residue, i.e., the remaining portion ofoverlapping clusters, is the portion that must be studied by humanexperts. This portion is extremely reduced and is also prestructured dueto the overlapping clusters.

This method can be carried out in an extremely short computation timebecause it is no longer necessary to compare each sequence separatelywith every other sequence in order to cluster an entire database.

In an advantageous embodiment, the threshold value is between 10⁻²⁰ and10⁻³⁵. In practice, a value of 10⁻³⁰ has proven feasible. In addition toclustering protein sequences, this method is also suitable for DNAsequences, where it may be appropriate to relax the threshold valuebeyond 10⁻²⁰.

A further refinement of the method according to this invention consistsof the fact that of the positive quantity of sequences found in oneiteration step, not only the sequence weighted as worst, is used foranother database search, but also all sequences of this quantity serveas a search sequence in additional searches. Due to the larger number ofsearches to be performed, this alternative is not as fast as the methoddescribed originally for an individual search. In conjunction withclustering, however, this does not cause any time loss. However, sincethe sequence space around the initial sequence is searched morethoroughly, the resulting cluster has a high probability of alreadyincluding all the sequences belonging to this protein family.

The “BLASTP” program is very suitable as a database searching program.This program is described in greater detail by S. F. Altschul, W. Gish,W. Miller, E. W. Myers, and D. J. Lipman, “Basic Local Alignment SearchTool,” J. Mol. Biol., 215: 403-410, 1990.

As an alternative, however, the “FASTA” database search program may alsobe used as described, for example, in the following literature citation:W. R. Pearson and D. J. Lipman, “Improved tools for biological sequencecomparison,” Proc. Natl. Acad. Sci. USA, 85: 2444-2448, 1988.

In addition to the “BLASTP” and “FASTA” database search programs, anyother database search program may also be used. For example, the “gappedBLAST” program (described by S. F. Altschul, T. L. Madden, A. A.Schaeffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, “Gapped BLASTand PSI-BLAST: a new generation of protein database search programs,”Nucleic Acids Research, 25 (17): 3389-3402, 1997, is also suitable.

Using the “BLASTP” program, the PIR1 database, release 51, has beenclustered by quantitative theory. This database contains 13,489 proteinsequences and is described in detail by David G. George, Richard J.Dodson, John S. Garavelli, Daniel H. Haft, Lois T. Hunt, Christopher R.Marzee, Bruce C. Orcutt, Kathryn E. Sidman, Geetha Y. Srinivasarao,Lai-Su L. Yeh, Lieslie M. Arminski, Robert S. Ledley, Akira Tsugita andWinoma Barker, “The Protein Information Resource (PIR) and the PIRinternational protein sequence database,” Nucleic Acids Research, 25(1): 24-27, 1997. Grouping this database required approximately one dayof computation time, and 91% of the database sequences were grouped intodisjunctive clusters in a fully automatic procedure. The residueincludes only approximately 9% of the database sequences.

The SWISS-PROT database, whose 34^(th) release contains 59,021sequences, is a much larger database. This database is described by AmosBairoch and Rolf Apweiler, “The SWISS-PROT protein sequence data bankand its supplement TrEMBL,” Nucleic Acids Research, 25 (1):31-36, 1997.Within approximately five days of computation time, 80% of the sequenceswere classified in disjunctive classes.

These examples show that extreme savings in terms of computation timeare possible with the method according to this invention in comparisonwith traditional methods, and the cluster results are excellent.

An algorithm for the method described here is presented below:

cluster←empty quantity

search sequence←inquiry sequence

as long as (search sequence defined)

database searched with the search sequence

positive quantity←all found sequences below the threshold level with aprobability value

search sequence←undefined

if (first search) then

reference quantity←positive quantity

end if

if (sequence exists in a positive quantity that is not contained in thecluster) and

(intersecting quantity between the positive quantity and referencequantity is not blank)

then

search sequence←sequence weighted as worst in the positive quantity notcontained in the cluster

cluster←combined quantity of cluster and positive quantity

end if

end as long as

The sequence with which the search method begins is called the initialsequence, and the quantity of found sequences is called the clusterbelonging to this initial sequence. First, a database search programsuch as “BLASTP” or “FASTA” is started with the initial sequence, andall sequences from the database that have a significant similarity withthe initial sequence are accepted. We call this quantity of sequences apositive quantity, and we include it in the cluster as relatedsequences. There is a significant similarity between two sequences whenthe probability that this similarity occurs randomly is very low, i.e.,below a given threshold. From the positive quantity thus obtained, wenow use the sequence weighted as worst (i.e., the sequence having thehighest probability) as a search sequence for another database search.This process is repeated as long as sequences below the threshold valueare found which are not contained in the cluster and as long as there isan intersection quantity between the positive quantity of the initialsequence and the positive quantity of the instantaneous search sequence.

Then the following algorithm is carried out:

input: database

output: partitioning of the database and groups with overlappingclusters

perform the above database search method for all (sequences in thedatabase)

cluster quantity←all clusters generated for all (identical clusters inthe cluster quantity) remove identical clusters except for onerepresentative

cluster quantity←clusters without identical clusters for all (clustersin the cluster quantity contained completely in another cluster) removethe smaller cluster of this cluster pair

cluster quantity←clusters without identical clusters and withoutinclusions

partitioning←all clusters in the cluster quantity that do not overlap

overlapping←all overlapping clusters in the cluster quantity arecombined in groups

To obtain database clustering, the method described above is carried outfor all sequences in the database, i.e., each sequence is assigned acluster of sequences to which it is related. From this quantity ofclusters, all identical clusters are removed except for one example,because they do not contain any additional information. The remainingcluster quantity is then examined for inclusions, and clusters that arecontained completely in other clusters are removed until no moreinclusions are present. Of this cluster quantity, the clusters that donot overlap with others can now be regarded as a logical partitioning ofthe database. The remaining small number of overlapping clusters arecombined in groups whose clusters are linked together among one anotherby overlapping.

An embodiment of this invention is described in greater detail below.

We use the sequence of the human homeobox engrailed-1 protein(HME1_HUMAN) as the inquiry sequence and then search the Swissprotdatabase (release 34) with it for related sequences. The search isperformed with the BLASTP program, and we select a threshold with aprobability of 10⁻³⁰. The result of this search looks approximately asfollows (excerpts):

Sequences found: Probability: SPR|Q05925|HME1_HUMAN HOMEOBOX PROTEINENGRAILED-1 (HU-E . . . 2.4e-279 SPR|P09065|HME1_MOUSE HOMEOBOX PROTEINENGRAILED-1 (MO-E . . . 4.7e-189 SPR|Q05916|HME1_CHICK HOMEOBOX PROTEINENGRAILED-1 (GG-E . . . 4.6e-132 SPR|Q05917|HME2_CHICK HOMEOBOX PROTEINENGRAILED-2 (GG-E . . . 3.6e-95 SPR|P19622|HME2_HUMAN HOMEOBOX PROTEINENGRAILED-2 (HU-E . . . 8.0e-95 SPR|P09066|HME2_MOUSE HOMEOBOX PROTEINENGRAILED-2 (MO-E . . . 5.6e-92 SPR|P09015|HME2_BRARE HOMEOBOX PROTEINENGRAILED-2 (ZF-E . . . 5.2e-70 SPR|P31538|HMEB_XENLA HOMEOBOX PROTEINENGRAILED-1B (EN- . . . 1.5e-{circle around (x)} SPR|P52729|HMEC_XENLAHOMEOBOX PROTEIN ENGRAILED-2A (EN- . . . 2.1e-66 SPR|P52730|HMED_XENLAHOMEOBOX PROTEIN ENGRAILED-2B (EN- . . . 2.1e-65 SPR|P31533|HME3_BRAREHOMEOBOX PROTEIN ENGRAILED-3 (ZF-E . . . 6.1e-64 SPR|Q04896|HME1_BRAREHOMEOBOX PROTEIN ENGRAILED-1 1.0e-61 SPR|P09145|HMEN_DROVI SEGMENTATIONPOLARITY PROTEIN ENGR . . . 9.1e-59 SPR|P05527|HMIN_DROME INVECTEDPROTEIN. 4.5e-57 SPR|P27609|HMEN_BOMMO SEGMENTATION POLARITY PROTEINENGR . . . 1.5e-55 SPR|P27610|HMIN_BOMMO INVECTED PROTEIN. 1.1e-52SPR|P09532|HMEN_TRIGR HOMEOBOX PROTEIN ENGRAILED (SU-HB . . . 4.0e-44SPR|Q05640|HMEN_ARTSF HOMEOBOX PROTEIN ENGRAILED. 1.7e-42SPR|P09076|HME3_APIME HOMEOBOX PROTEIN E30 (FRAGMENT). 2.1e-41SPR|P09075|HME6_APIME HOMEOBOX PROTEIN E60 (FRAGMENT). 1.0e-40SPR|P14150|HMEN_SCHAM HOMEOBOX PROTEIN ENGRAILED (G-EN . . . 2.3e-40SPR|P23397|HMEN_HELTR HOMEOBOX PROTEIN HT-EN (FRAGMENT). 1.3e-38SPR|P31537|HMEA_XENLA HOMEOBOX PROTEIN ENGRAILED-1A (EN- . . . 7.9e-33SPR|P31535|HMEA_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE A . . . 1.1e-27SPR|P34326|HM16_CAEEL HOMEOBOX PROTEIN ENGRAILED-LIKE CE . . . 7.1e-27SPR|P31536|HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B . . . 5.0e-26

On the basis of the threshold at 10⁻³⁰, our cluster now contains thefollowing sequences:

HME1_HUMAN, HME1_MOUSE, HME1_CHICK, HME2_CHICK, HME2_HUMAN, HME2_MOUSE,HME2_BRARE, HMEB_XENLA, HMEC_XENLA, HMED_XENLA, HME3_BRARE, HME1_BRARE,HMEN_DROVI, HMIN_DROME, HMEN_BOMMO, HMEN_DROME, HMIN_BOMMO, HMEN_TRIGR,HMEN_ARTSF, HME3_APIME, HME6_APIME, HMEN_SCHAM, HMEN_HELTR, HMEA_XENLA.

The next run through the BLASTP program is then carried out with thesequence weighted as worst in this quantity, namely with theengrailed-1A homeobox protein of the horned toad (HMEA_XENLA). Theresult of this search looks as follows (excerpts):

Sequences found: Probability: SPR|P31538|HMEB_XENLA HOMEOBOX PROTEINENGRAILED-1B (EN- . . . 2.8e-36 SPR|P31537|HMEA_XENLA HOMEOBOX PROTEINENGRAILED-1A (EN- . . . 3.2e-36 SPR|P09015|HME2_BRARE HOMEOBOX PROTEINENGRAILED-2 (ZF-E . . . 1.1e-34 SPR|Q05925|HME1_HUMAN HOMEOBOX PROTEINENGRAILED-1 (HU-E . . . 1.3e-33 SPR|P09065|HME1_MOUSE HOMEOBOX PROTEINENGRAILED-1 (MO-E . . . 1.4e-33 SPR|Q05916|HME1_CHICK HOMEOBOX PROTEINENGRAILED-1 (GG-E . . . 1.5e-33 SPR|P52729|HMEC_XENLA HOMEOBOX PROTEINENGRAILED-2A (EN- . . . 9.9e-33 SPR|P52730|HMED_XENLA HOMEOBOX PROTEINENGRAILED-2B (EN- . . . 5.9e-32 SPR|Q05917|HME2_CHICK HOMEOBOX PROTEINENGRAILED-2 (GG-E . . . 1.3e-31 SPR|P09066|HME2_MOUSE HOMEOBOX PROTEINENGRAILED-2 (MO-E . . . 1.8e-31 SPR|P19622|HME2_HUMAN HOMEOBOX PROTEINENGRAILED-2 (HU-E . . . 2.0e-31 SPR|Q04896|HME1_BRARE HOMEOBOX PROTEINENGRAILED-1. 8.1e-31 SPR|P31535|HMEA_MYXGL HOMEOBOX PROTEINENGRAILED-LIKE A . . . 4.3e-30 SPR|P31533|HME3_BRARE HOMEOBOX PROTEINENGRAILED-3 (ZF-E . . . 6.7e-30 SPR|P31536|HMEB_MYXGL HOMEOBOX PROTEINENGRAILED-LIKE B . . . 1.8e-28 SPR|P09532|HMEN_TRIGR HOMEOBOX PROTEINENGRAILED (SU-HB- . . . 8.8e-28 SPR|P31534|HMEN_LAMPL HOMEOBOX PROTEINENGRAILED-LIKE (E . . . 8.8e-28 SPR|P09075|HME6_APIME HOMEOBOX PROTEINE60 (FRAGMENT). 2.1e-26 SPR|P23397|HMEN_HELTR HOMEOBOX PROTEIN HT-EN(FRAGMENT). 2.3e-26 SPR|P09076|HME3_APIME HOMEOBOX PROTEIN E30(FRAGMENT). 3.9e-26

Let us again consider all sequences having a probability lower than10⁻³⁰ we and find that except for HMEA_MYXGL, all sequences arecontained in the cluster. This sequence is now included in the cluster,and the next BLASTP search is started with it. This search yields thefollowing result (excerpts):

Sequences found: Probability: SPR|P31535|HMEA_MYXGL HOMEOBOX PROTEINENGRAILED-LIKE A . . . 3.8e-36 SPR|P31534|HMEN_LAMPL HOMEOBOX PROTEINENGRAILED-LIKE (E . . . 1.5e-30 SPR|P31538|HMEB_XENLA HOMEOBOX PROTEINENGRAILED-1B (EN- . . . 1.8e-30 SPR|P31537|HMEA_XENLA HOMEOBOX PROTEINENGRAILED-1A (EN- . . . 3.8e-30 SPR|P52729|HMEC_XENLA HOMEOBOX PROTEINENGRAILED-2A (EN- . . . 4.9e-29 SPR|P09015|HME2_BRARE HOMEOBOX PROTEINENGRAILED-2 (ZF-E . . . 1.4e-28 SPR|Q05925|HME1_HUMAN HOMEOBOX PROTEINENGRAILED-1 (HU-E . . . 1.7e-28 SPR|P09065|HME1_MOUSE HOMEOBOX PROTEINENGRAILED-1 (MO-E . . . 1.8e-28 SPR|P09066|HME2_MOUSE HOMEOBOX PROTEINENGRAILED-2 (MO-E . . . 3.1e-28 SPR|P19622|HME2_HUMAN HOMEOBOX PROTEINENGRAILED-2 (HU-E . . . 3.3e-28 SPR|Q05916|HME1_CHICK HOMEOBOX PROTEINENGRAILED-1 (GG-E . . . 4.6e-28 SPR|P52730|HMED_XENLA HOMEOBOX PROTEINENGRAILED-2B (EN- . . . 2.1e-27 SPR|Q05917|HME2_CHICK HOMEOBOX PROTEINENGRAILED-2 (GG-E . . . 2.2e-27 SPR|P09075|HME6_APIME HOMEOBOX PROTEINE60 (FRAGMENT). 2.9e-27 SPR|P23397|HMEN_HELTR HOMEOBOX PROTEIN HT-EN(FRAGMENT). 4.4e-27 SPR|Q04896|HME1_BRARE HOMEOBOX PROTEIN ENGRAILED-1.4.9e-27 SPR|P09076|HME3_APIME HOMEOBOX PROTEIN E30 (FRAGMENT). 5.4e-27SPR|P31533|HME3_BRARE HOMEOBOX PROTEIN ENGRAILED-3 (ZF-E . . . 2.0e-26SPR|P31536|HMEB_MYXGL HOMEOBOX PROTEIN ENGRAILED-LIKE B . . . 8.8e-26

This time we add HMEN_LAMPL to our cluster, and we start the next BLASTPsearch with this sequence, yielding the following result (excerpt):

Sequences found: Probability: SPR|P31534|HMEN_LAMPL HOMEOBOX PROTEINENGRAILED-LIKE (E . . . 5.7e-37 SPR|P31535|HMEA_MYXGL HOMEOBOX PROTEINENGRAILED-LIKE A . . . 5.0e-31 SPR|P31538|HMEB_XENLA HOMEOBOX PROTEINENGRAILED-1B (EN- . . . 1.4e-28 SPR|P31537|HMEA_XENLA HOMEOBOX PROTEINENGRAILED-1A (EN- . . . 2.9e-28 SPR|P23397|HMEN_HELTR HOMEOBOX PROTEINHT-EN (FRAGMENT). 1.2e-27 SPR|P31536|HMEB_MYXGL HOMEOBOX PROTEINENGRAILED-LIKE B . . . 1.4e-27 SPR|Q04896|HME1_BRARE HOMEOBOX PROTEINENGRAILED-1. 1.5e-27 SPR|P09015|HME2_BRARE HOMEOBOX PROTEIN ENGRAILED-2(ZF-E . . . 6.9e-27 SPR|Q05925|HME1_HUMAN HOMEOBOX PROTEIN ENGRAILED-1(HU-E . . . 1.5e-26 SPR|P09065|HME1_MOUSE HOMEOBOX PROTEIN ENGRAILED-1(MO-E . . . 1.6e-26 SPR|P09075|HME6_APIME HOMEOBOX PROTEIN E60(FRAGMENT). 1.9e-26 SPR|Q05916|HME1_CHICK HOMEOBOX PROTEIN ENGRAILED-1(GG-E . . . 4.5e-26

Above the threshold, we do not find any sequences that would not alreadybe contained in our cluster, so the SYSTERS search for this inquirysequence is now concluded, and the cluster contains the following 26sequences:

HME1_HUMAN, HME1_MOUSE, HME1_CHICK, HME2_CHICK, HME2_HUMAn, HME2_MOUSE,HME2_BRARE, HMEB_XENLA, HMEC_XENLA, HMED_XENLA, HME3_BRARE, HME1_BRARE,HMEN_DROVI, HMIN_DROME, HMEN_BOMMO, HMEN_DROME, HMIN_BOMMO, HMEN_TRIGR,HMEN_ARTSF, HME3_APIME, HME6_APIME, HMEN_SCHAM, HMEN_HELTR, HMEA_XENLA,HMEA_MYXGL, HMEN_LAMPL.

If this procedure is performed for all 28 sequences annotated ashomeobox engrailed in the Swissprot database, this yields 28 clusters atfirst. The clusters thus found are plotted in the following tableagainst the sequences, where the columns represent the clustersbelonging to the inquiry sequence listed at the head of the table andthe line indicate the clusters in which the sequence listed at the leftis contained (marked with an X). In this case, there are seven clustershaving 27 sequences each, five clusters having 26 sequences each, etc.

Cluster (inquiry sequence) Sequence 1 (HME2_BRARE) 2 (HME2_MOUSE) 3(HMEB_XENLA) 4 (HMEC_XENLA) 5 (HMED_XENLA) 6 (HMEN_LAMPL) 7 (HMEA_MYXGL)8 (HMEA_XENLA) HME2_BRARE X X X X X X X X HME2_MOUSE X X X X X X X XHMEB_XENLA X X X X X X X X HMEC_XENLA X X X X X X X X HMED_XENLA X X X XX X X X HMEN_LAMPL X X X X X X X HMEA_MYXGL X X X X X X X X HMEA_XENLA XX X X X X X X HM16_CAEEL X X X X X X X X HME1_MOUSE X X X X X X X XHMEN_DROME X X X X X X X X HMIN_DROME X X X X X X X X HME6_APIME X X X XX X X X HME3_APIME X X X X X X X X HMEN_DROVI X X X X X X X X HMEN_TRIGRX X X X X X X X HMEN_SCHAM X X X X X X X X HMEN_HELTR X X X X X X X XHMEN_BOMMO X X X X X X X X HME3_BRARE X X X X X X X X HME1_BRARE X X X XX X X X HMIN_BOMMO X X X X X X X X HMEN_ARTSF X X X X X X X X HME2_HUMANX X X X X X X X HME1_CHICK X X X X X X X X HME2_CHICK X X X X X X X XHME1_HUMAN X X X X X X X X HMEB_MYXGL Cluster (inquiry sequence)Sequence 9 (HM16_CAEEL) 10 (HME1_MOUSE) 11 (HMEN_DROME) 12 (HMIN_DROME)13 (HME6_APIME) 14 (HME3_APIME) 15 (HMEN_DROVI) 16 (HMEN_TRIGR)HME2_BRARE X X X X X X X X HME2_MOUSE X X X X X X X X HMEB_XENLA X X X XX X X X HMEC_XENLA X X X X X X X X HMED_XENLA X X X X X X X X HMEN_LAMPLHMEA_MYXGL X HMEA_XENLA X X X X X X X X HM16_CAEEL X X X X X X X XHME1_MOUSE X X X X X X X X HMEN_DROME X X X X X X X X HMIN_DROME X X X XX X X X HME6_APIME X X X X X X X X HME3_APIME X X X X X X X X HMEN_DROVIX X X X X X X X HMEN_TRIGR X X X X X X X X HMEN_SCHAM X X X X X X X XHMEN_HELTR X X X X X X X X HMEN_BOMMO X X X X X X X X HME3_BRARE X X X XX X X X HME1_BRARE X X X X X X X X HMIN_BOMMO X X X X X X X X HMEN_ARTSFX X X X X X X X HME2_HUMAN X X X X X X X X HME1_CHICK X X X X X X X XHME2_CHICK X X X X X X X X HME1_HUMAN X X X X X X X X HMEB_MYXGL Cluster(inquiry sequence) Sequence 17 (HMEN_SCHAM) 18 (HMEN_HELTR) 19(HMEN_BOMMO) 20 (HME3_BRARE) 21 (HME1_BRARE) 22 (HMIN_BOMMO) (23(HMEN_ARTSF) HME2_BRARE X X X X X X X HME2_MOUSE X X X X X X XHMEB_XENLA X X X X X X X HMEC_XENLA X X X X X X X HMED_XENLA X X X X X XX HMEN_LAMPL HMEA_MYXGL HMEA_XENLA X X X X X HM16_CAEEL X X X X X XHME1_MOUSE X X X X X X X HMEN_DROME X X X X X X X HMIN_DROME X X X X X XX HME6_APIME X X X X X X X HME3_APIME X X X X X X X HMEN_DROVI X X X X XX X HMEN_TRIGR X X X X X X X HMEN_SCHAM X X X X X X X HMEN_HELTR X X X XX X X HMEN_BOMMO X X X X X X X HME3_BRARE X X X X X X X HME1_BRARE X X XX X X X HMIN_BOMMO X X X X X X X HMEN_ARTSF X X X X X X X HME2_HUMAN X XX X X X X HME1_CHICK X X X X X X X HME2_CHICK X X X X X X X HME1_HUMAN XX X X X X X HMEB_MYXGL Cluster (inquiry sequence) Sequence 24(HME2_HUMAN) 25 (HME1_CHICK) 26 (HME2_CHICK) 27 (HME1_HUMAN) 28(HMEB_MYXGL) HME2_BRARE X X X X HME2_MOUSE X X X X HMEB_XENLA X X X XHMEC_XENLA X X X X HMED_XENLA X X X X HMEN_LAMPL X X X X HMEA_MYXGL X XX X HMEA_XENLA X X X X HM16_CAEEL HME1_MOUSE X X X X HMEN_DROME X X X XHMIN_DROME X X X X HME6_APIME X X X X HME3_APIME X X X X HMEN_DROVI X XX X HMEN_TRIGR X X X X HMEN_SCHAM X X X X HMEN_HELTR X X X X HMEN_BOMMOX X X X HME3_BRARE X X X X HME1_BRARE X X X X HMIN_BOMMO X X X XHMEN_ARTSF X X X X HME2_HUMAN X X X X HME1_CHICK X X X X HME2_CHICK X XX X HME1_HUMAN X X X X HMEB_MYXGL X

After removing identical clusters and solving for inclusions, thehomeobox engrailed proteins are distributed among two clusters—one with27 sequences and the other with only the HMEA_MYXGL sequence.

What is claimed is:
 1. A method of grouping sequences in a sequencedatabase to provide a sequence cluster containing similar sequences, themethod comprising: (a) providing an inquiry sequence; (b) determining apositive quantity of similar sequences to said inquiry sequence from thesequence database using a database search program, wherein said positivequantity of similar sequences includes all sequences in said sequencedatabase for which the probability that similarity occurs randomly isbelow a predetermined threshold level; (c) selecting one sequence havingthe highest probability of randomly occurred similarity from saidpositive quantity determined for said inquiry sequence as a subsequentsearch sequence; (d) determining a subsequent positive quantity ofsimilar sequences for said subsequent search sequence from the sequencedatabase using the database search program as in (b); (e) identifing newsequences in said subsequent positive quantity that are not included insaid positive quantity determined for said inquiry sequence, said newsequences being combined with said positive quantity of similarsequences of said inquiry sequence; (f) repeating steps (c) to (e) aslong as new sequences are still identified that are not contained in thecombined positive quantity of similar sequences of step (e) and as longas there is an intersection between sequences in the combined positivequantity and any sequences in the subsequent positive quantity; and (g)outputting all the different sequences contained in the combinedpositive quantity as one cluster containing similar sequences.
 2. Amethod of clustering all sequences in a sequence database, according tothe method of claim 1, further comprising: (h) repeating steps (a)-(g)for each sequence in the sequence database to obtain multiple clusters,(i) removing clusters that occur more than once except for one cluster;(j) removing clusters that are contained in other clusters, (j) of theremaining clusters, outputting those clusters that do not overlap withother clusters as partitioning of the database, and (k) outputtingremaining overlapping clusters as groups whose clusters are linkedtogether by overlapping.
 3. The method of claim 1, wherein the thresholdlevel of probability is between 10⁻²⁰ and 10⁻³⁵.
 4. The method of claim2, wherein the threshold level of probability is between 10⁻²⁰ and10⁻³⁵.